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INTRODUCTION 

Pattern recognition plays a central role in numerically 
oriented remote sensing systems. It provides an automatic 
procedure for deciding to which class any given ground resolu- 
tion element should be assigned. The assignment is made in such 
a manner that on the average correct classification is achieved. 

This information note describes briefly the theoretical basis 
for the pattern-recognition-oriented algorithms used in LARSYS, 
the multispectral data analysis software system developed by 
the Laboratory for Applications of Remote Sensing (LARS) . 

Figure 1 shows a model of a general pattern recognition 
system. In the LARS context the receptor or sensor is usually 
a multispectral scanner. For each ground resolution element 
the receptor produces n numbers or measurements corresponding to 
the n channels of the scanner. It is convenient to think of 
the n measurements as defining a point in n-dimensional Euclidean 
space which is referred to as the measurement apace. Any particul 
measurement can be represented by the vector: 
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Figure 1. A Pattern Recognition System 



Figure 2. A Simplified Model of a 

Pattern Recognition System 
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The feature extractor transforms the n- dimensional measure- 
ment vector into an n' -dimensional feature vector. In LARSYS, 
this consists simply of selecting a subset of the components of 
the measurement vector, but much more complex transformations 
are possible (see, for example. Ready et al, 1971). 

The decision maker in Figure 1 performs calculations on the 
feature vectors presented to it and, based upon a decision rule, 
assigns the "unknown" data point to a particular class. 

For the present, it will be sufficient to simplify the 
model to that shown in Figure 2. The vector X may subsequently 
be referred to as either a measurement vector ora feature vector. 


DISCRIMINANT FUNCTIONS ; QUANTIFYING THE DECISION PROCEDURE 
Patterns arising in remote sensing problems exhibit some 
randomness due to the randomness of nature. As an example, one 
cannot in general expect the vector of measurements corresponding 
to a particular ground resolution element from one part of a 
wheat field to correspond exactly to the vector corresponding to 
a ground resolution element from another part of the field. 
Rather, vectors from the same class tend to form a "cloud" of 
points as shown in Figure 3. The job of the pattern classifier 




Figure 4. A Pattern Classifier Defined in Terms 
of Discriminant Functions 
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is to divide the feature space into decieior^ regions ^ each 
region corresponding to a specific class. Any data point falling 
in a particular region is assigned to the class associated with 
that region. The surfaces separating the decision regions are 
known as decision surfaces Designing a pattern recognizer 
really boils down to devising a procedure for determining the 
decision surfaces so as to optimize some performance criterion, 
such as maximizing the frequency of correct classification. 

These concepts can be put on a quantitative basis by intro- 
ducing discriminant functions . Assume there are m pattern 
classes. Let gi (X) , g 2 (X) , . . . , g^(X) be scalar single- valued 
functions of X such that g^(X)> gj (X) for all X in the region 

^ Vi 

corresponding to the i class (j?^i). If the discriminant functions 
are continuous across the decision boundaries, the decision sur- 
faces are given by equations of the form 

gi(X) - g. (X) » 0. (2) 

A pattern classifier can then be represented by the block 
diagram of Figure 4. 

By taking this approach the pattern classifier design 
problem is reduced to the problem of how to select the discrim- 
inant functions in an optimal fashion. 

"TRAINING" THE CLASSIFIER 

In some cases it is possible to select discriminant func- 
tions on the basis of theoretical considerations, experience. 
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or perhaps even intuition. More commonly the discriminant 
functions are based upon a set of training ’patterns. Training 
patterns which are typical of those to be classified are "shown" 
to the classifier together with the identity of each patteriv and 
based on this information the classifier establishes its dis- 
criminant functions gj^(X), i=l, 2, ...» m. 

Example: Consider a two=dimensional , two-class problem in 

which the discriminant functions are assumed to have the form 

gi (X) = an xi + ai2 X2 + bi 

f 31 

g2(X) = a2i Xi + a22 X2 + b2 

Then gi (X) - g 2 (X) = 0 is the equation of a straight line 
dividing the xi , X 2 plane. Given a set of training patterns, 
how should the constants an, an, bi, etc. be chosen? It can 

be proven that if the training patterns are indeed separable 
by a straight line, then the following procedure will converge 
(Nilsson, 1965): 

Initially select a's and b’s arbitrarily. For 
example let 

®ll ■ * bi = 1 

(4) 

^21 ® ^22 * b 2 “ -1 

Then take the first training pattern (say it is from , 
i.e. , from class 1) and calculate gi(X) and g 2 (X) . If 
gi(X) > g 2 (X) the decision is correct; go on to the next 
training sample. If gi (X) < g 2 (X) a wrong decision would 
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be made. In this case alter the coefficients so as 
to increase the discriminant function associated with 
the correct class and decrease the discriminant func- 
tion associated with the incorrect class. If X is from 
wi but 0)2 was decided, let 


« * 


ail “ ail 

+ 

axi 

a2 1 = a2 1 

- axi 

t 

ai 2 = ai 2 

+ 

ax 2 

1 

a22 = a22 

- ax 2 

43 

II 

43 

+ 

a 

t 

b2 = b2 

- a 


(5) 


where a is a convenient positive constant. If X is from 
0)2 but 0)1 was decided, change the signs in Eq. (5) so as 
to increase g2 and decrease gi. Then go on to the next 
training pattern. Cycle -through the training patterns until 
all are correctly classified. 

Suggestion: Design and work out a numerical example to 

illustrate the training process described above. Assume two 
classes, two dimensions, and two training patterns per class. 


THE STATISTICAL APPROACH 

Remote sensing is typical of many practical applications of 
pattern recognition for which statistical methods are appropriate 
in the following respects: 

•The data exhibit many incidental variations (noise) which 
tend to obsure differences between the pattern classes. 

•There is often uncertainty, however small, concerning the 
true identity of the training patterns. 

•The pattern classes of interest may actually overlap in 
the measurement space (may not always be discriminable) , 
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suggesting the use of an approach which leads to decisions 
which are "most likely" correct. 

Statistical pattern recognition techniques often make use 
of the probability density functions associated with the 
pattern classes (including the approach to be described here) . 
However, the density functions are usually unknown and must be 
estimated from a set of training patterns. In some cases, the 
form of the density functions is assumed and only certain para- 
meters associated with the functions are estimated. Such methods 
are called "parametric." Methods for which not even the form 
of the density functions is assumed are called "nonparametric. " 
The parametric case requires more a priori knowledge or some 
basic assumptions regarding the nature of the patterns. The non- 
parametric case requires less initial knowledge and fewer assump- 
tions but is generally more difficult to implement. 

Let there be m classes characterized by the conditional 
probability density functions 

p(Xlw^) i = 1, 2, ..., m. (6) 

The function p(X|o)^) gives the probability of occurrence of 
pattern X, given that X is in fact from class i. 

An important assumption in the LARSYS algorithms is that 

\ 

the p(X|o)^) are each multivariate gaussian (or normal) distri- 
butions. This is a parametric assumption which leads to a form 
of classifier which is relatively simple to implement. Under this 
assumption, a mean vector and covariance matrix are sufficient to 
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characterize the probability distribution of any pattern class. 

Returning to the problem of how to specify the discriminant 
functions, an approach based on statistical decision theory is 
taken. A set of loss functions is defined 

X(i|j) i = l,-2, ..., m; j « 1, 2, ..., m (7) 


where X(i|j) is the loss (or cost) incurred when a 
pattern is classified into class i when it is actually from 
class j. 

If the pattern classifier is designed so as to minimize the 
average (expected) lose, then the classifier is said to be Bayes 
optimal. This is the criterion to be used in specifying the 
classification algorithm. 

For a given pattern X, the expected loss resulting from 
the decision Xeu^ is given by 

L„(i) = ” A(i|j)p(w. |X) (8) 

^ j = l ^ 

where p(o)j|X) is the probability that a pattern X is from class 
j. Applying Bayes' rule, i.e., 

p(X,Wj) = p(Xlu)^Op(Wj) * p(Wj|X) p(X) (9) 

the expected loss can be written as 


Ljj(i) = 2^ ^ (i I j )p (X I Wj )p (wj ) /p 


where p(oj.) is the a priori probability of u>.. 

1 , •' 

Note that minimizing L^(i) with respect to i is the same 
as maximizing Thus a suitable set of discriminant 
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functions is 


g “ “L 2, •••» n'* (11) 

A simple (and reasonable) loss function is 


X(ilj) = 0 i = j 

X(i|j) = 1 i j 


( 12 ) 


(zero loss for correct classification, unit loss for any error) . 
Then 


g. (X) = - Z p(X|u.)p(w.)/p(X) (13) 

1 1 1 

in 


Here and at several points later in this paper it will be con- 
venient to make use of the following fact: from any set of 

discriminant functions, another set of discriminant functions can be 
formed by taking the same monotonic function of each of the 
original discriminant functions. For example, if 


gj^(X)., i ® 1, 2, ...» 


m 


is a set of discriminant functions, then so are the sets 


gj^(^) “ gj(X) + constant i = 1, 2, 


m 


and 


t f 


g^ (X) « 10g[g^(X)l 


i = 1, 2, 


m. 


Examining (13) note that p(X) is not a function of i so it 
is just as well to maximize 
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g!^(X)= -E p(X|Wj)p(o)j) = |p(X)-p(X|u^)p(a)j.)j . (14) 

m 

But this is maximum if 

g'.'(X) = p(X|o).)p((o.) (15) 

is maximum. Thus, the decision rule is: 

Decide 

Xewv if and only if 

p(X|w^)p(w^) ^ p(X|w^.)p(Wj) for all j* (16) 

This is commonly referred to as the maximum likelihood decision 
rule . 

Example: Consider two pairs of dice, one a standard pair 

and a second pair with two additional spots on each face. The 
probability functions associated with rolling a particular number 
with these dice are shown in Figure 5. Note how application of 
the decision rule (16) coincides with what you would do intuitively 
if the question were asked, "Given that a y was rolled, decide 
which pair of dice was used." Let y = 4, 7, 13. Note that 
p (standard dice) = p (augmented dice) = 0.5. 

Consider the maximum likelihood discriminant function as 
it applies to remote sensing. The p(w^) represents the a priori 


*Ties (the case of equality in (16)) may be arbitrarily decided 
by, say, always deciding Xew^ if (X) = (X) and i> j . 
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probability of the i^^ class. This can often be estimated. 

Taking agricultural crop types as an example, the p(w.) may 
be estimated from previous year yields, seed sales records or 
statistical reporting service information. The densities p(X|u^), 
on the otherhand, generally have to be estimated from training 
samples. 

The assumption upon which the classification algorithms 
are based is that p(X|uj^) is a multivariate gaussian prob- 
ability density function. This basic assumption is supported 
by the following observations: 

a) It is a reasonable model of the natural situation. 

b) It results in a computationally simple Ctherefore 
inexpensive) discriminant function. 

c) It works (or try it - you’ll like it!). 


Examining the maximum likelihood decision criterion in 
the one-dimensional gaussian case will serve both as a review 
of gaussian density functions and as a means of illustrating 
the principles of pattern classification. In this case (eg., 
one spectral channel) p -i 


p(x|o)^) = 


exp 


- 1/2 


(x-Pj^) 


( 18 ) 


(2tt)*/^o^ 

where = E[x] and a[ = E[(x - y^)M are the mean and variance 
for class i. In practice y^ and a.^are unknown and must be 
estimated from training samples. From statistical theory. 
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u. 


IHi 




nt 

Z 


j = l 


(19) 


a.^= s.^ ®- — y 

115;^ j=i 


Z (x.-m.)^ 

1 1 


( 20 ) 


(nt * number of training patterns in class i) 
are unbiased estimators of the mean and variance. Thus the 
estimated density function is 


P (x |w^) = 


( 2 tt )»/2 Si 


exp 


1/2 


(x-mi) 


"i‘ 


( 21 ) 


Following the decision theory approach the discriminant 
function is 


gi(x) =• 


p(wi) 


e2ir) V^Si 


exp 


(x-m.) 

- 1/2 


( 22 ) 


and since a monotonic function of a discriminant function may 
also be used as a discriminant function, we shall take the 
logarithm of the previous function to obtain 


, (x-m-)* 

g. (x) = log p(u.) - 1/2 log 2ir - log s- - 1/2 . (23) 

X ± X _ 2 


Since the constant term - 1/2 log 2 tt appears in all of the g^ 
it may be dropped to yield 


,, (x-m.)* 

(X) = log p ((D^) - log Si - 1/2 5 — , 

5i 


(X) 


(24) 
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Thus the decision rule becomes: 


Decide X eu^ if and only if 


(x-m. ) 


(x-m.) 


log p(w.)-log S.-1/2 i- > log p(w.)-log S.-1/2 —4 


( 25 ) 


"i 




The one dimensional case just described serves to 
illustrate the Bayes decision rule for gaussian statistics, 
In the two dimensional case 

X, 


X = 


(26) 


and 


p(X|w.) = 


2 

^^(^ill ^i22 ■ ^il2^ 


(27) 


(Xi-y-i) 


'^ill 


^^il2 ^^2‘^i2^ ^^2"'^i2^ 


r >>1/2 

^^ill * ^i22^ 


'^12 2 


exp 


- 1/2 


1 - 


®il2 


'^ill ®122 


where 

y^^ = E[Xj^|Wj^], y^2~E[x2|wj^] 

j,k - 1,2 (28) 

= E [ (Xj j ) (Xj^-y I ] i = 1,2, ...R 

This is a formidible expression, but by defining a mean vector 
and covariance matrix 
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U. = 
1 


y 


il 


Z. = 
1 


I • A 
1 2 


^ill ^il2 


^i21 ^122 


C29) 


(30) 


the density can be rewritten in the simple form: 

1 

p(X |u3j) 


2n z. 

1 


77 exp 


-1/2 (X-U.)'^ Z. * (X-U.) 


(31) 


where |Z^[is the determinant of Z^ and (X - U^) is the trans- 
pose of (X - U^) . The beauty of the matrix formulation is 
that it holds for n dimensions as well as for 2 dimensions. 

For the multivariate gaussian case, the maximum likelihood 
discriminant function is given by 


gj (X) = p(X |a)^)p(o)^) 


= p(w^). 


exp 


- l/2(x-up’^ Zi'(X-Ui) 


(32) 


Taking the log and eliminating the constant term 


t f 


- I 


g_ (X) = log p(o).)-l/2 loglZjl- 1/2CX-Uj)^ Z. (X-U.) 

(33) 

The corresponding decision rule is: 

Decide X if and only if 
1 1 


gi (X) i gj (X) all i,j 


(34) 
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When and are not known, they must be estimated from 
training patterns. Denoting the estimates as U^and and dropping 
the subscripts indicating class to simplify the notation: 


U 


M = 


^ _ 




■ 

mx 


S|1 

s 1 2 • - ' 

^n 

m2 


S2 1 

S22 ' • * 

®2n 

• 

and E - S = 

• 

• 

• 

• 

m 


• 

• 

s 

• 

• 

• • • 

s ^ 

n 

m 


- 

A2 

nn 

• 


. (35) 


Unbiased estimators are 


m 



nt 

E 

t=l 



j = l, 2, . . .n 


(36) 


s 1 = 


^t 

S (x. .-mj (x, .-m, ) 


jk n^-l "V 

j*— l,2,.».,n, k~l,2,...,n 


(37) 


where n^ is the number of training patterns. 

VECTOR CLASSIFICATION m LARSYS 

The classification algorithm currently in LARSYS is essen- 
tially the decision rule defined by Eq. (33) and (34), except 
that all class probabilities are assumed equal; i.e. , 

p(wi)-p(o)2) = . . .=p(o)jjj)-^. 
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The required mean vectors and covariance matrices are computed 
from training patterns by the statistics processor. The clas- 
sification processor computes the g^(X), i»l,2,...m for every 
data vector in the area to be classified. For each vector the 
class decided and the value of the discriminant function com- 
puted for that class are written on magnetic tape for later use 
by the results display processor. 

Inevitably there are points in the area classified which 
do not belong to any of the classes defined by the training 
samples. In agricultural settings such points might be from 
roads, fence lines, farmsteads, and the like. The classification 
procedure necessarily assigns these points to one of the train- 
ing classes, but typically they may be expected to yield very 
small discriminant values. The later fact can be utilized to 
detect them, as will now be described. 

Consider Figure 6. In this one-dimensional, two-class 
example, the points to be detected are those "not very much 
like" any of the training classes and therefore having a low 
probability of belonging to any of the training classes. Thus 
by "rejecting" or "thresholding" a very small percentage of the 
points actually belonging to the training classes, it is pos- 
sible to reject a relatively large number of points not belonging 
to the training classes. This can be done simply by computing 
the probability density value associated with the data vector 
and "rejecting" the point if the value is below a user-specified 


threshold. 
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P (xlo)) 


’rejects" 


"rejects' 



- 20 - 


But this can be accomplished just as well using the discrim- 
inant values stored as part of the classification result. If X 
is n-dimensional and normally distributed then the quadratic 
form 

(X-U.)"^ (X-U^) (38) 

has a chi-square distribution with n degrees of freedom 
(C^(x^))» Therefore to threshold, say, P percent of the normal 
distribution shown in Figure 7a, it is just as well to threshold 

" rp 

P percent of the chi-square distribution of (X - IK) (X - U^) . 

This quadratic form is related to gj|^(X) in the following manner: 

(X-U.)'^Z:*(X-U.) = -2g.(X) + 2b. (39) 

where 

b^ = log p(wj) - 1/2 log \l^\ (40) 

Thus, every point for which 

-2g^(X) + 2b^>(x^ for which C^^(x^) = P/100) (41) 

is rejected or thresholded. Note that a different threshold 
value may be applied to each class. 

FEATURE SELECTION 

Problem: Given a set of N features (eg. , multispectral 

scanner channels) , find a subset consisting of n channels which 
provides an optimal trade-off between classification costs 
(complexity and time for computation) and classification accuracy. 
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Ideally, one would like to solve this problem by computing 
the probability of misclassification associated with each 
n- feature subset and then selecting the one giving best per- 
formance. However, it is generally not feasible to perform the 
required computations. Even under the simplifying assumption 
of normal statistics, numerical integration is required which, 
in the multidimensional case, is impractical to carry out. To 
see this, consider that 

/ nU N! 

\ nj n!(N-n)! 

subsets of features must be evaluated. Thus, for example, to 
select the best 4 out of 12 available features requires 

/12V 12! = 495 (43) 

\ 4 ] 4! 8! 

integrations in 4- dimensional space. Even on the fastest 
computers, such computations would be prohibitive. Alternative 
methods must be found for feature selection. 

From Figure 8, the probability of error (proportional to the 
shaded area) can be seen to be a function of the "normalized 
distance" between the classes. That is, the error depends upon 
both the distance between the means as well as the variance of each 
class. The greater the "distance" the smaller the probability 

of error. 

One measure of the distance between classes is known as 
divergence. Divergence is defined in terms of the tiketihood vatio 
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p(Xl w) 



p(X| w) 



(b) 

Figure 8. Classification Error Depends on Distance 
Between Means and on Variance. 
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L. 

1 



p(Xl(o.) 

pCX|o)^) 


(44) ^ 


which is a measure or indication of the separability of the 
densities at X. The logarithm of the likelihood ratio provides 
an equivalent indication of the separability of the densities! 

Lij (X) = log Lij(X) = log p(X|u)^) - log p(X|o)^) 
Divergence is defined* as 

D(i, j |ci , C 2 ,. .tj^) = 

E[lL(X)|o).] - E(Lj.(X) |o).] (46) 

for channels ci, C 2 ,...,Cj^ where 

E[L^j (X)|o).)] p(X|a,.) dX (47) 

Divergence has the following properties: 

1) D(i , j I Cl , . . . ,c^) > 0 for non-identical distributions 

2) D(i,i|ci,... ,c^) =0 

3) D(i, j Ici . ,c^) = D(j,i|ci, cz,...c^) (48) 

4) Divergence is additive for independent features 

n 

D(i,j|ci, C2,...,c ) = Z ,D(i,jIc,) 

^ k=l ^ 

5) Adding new features never decreases the divergence, i.e. 

Divergence is defined for any two density functions. In the 
See for instance Kullback, 1959. 


(45) 
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case of normal variables with unequal covariance matrices, it can 
be shown that 


D(i,jlc.,...,c^) = 1/2 

+ 1/2 tr[0:.-' + z^‘M(u.-u.)(u.-u^.)'^] 


(49) 


where tr[A] (trace A) is the sum of the diagonal elements of A. 

Divergence is a measure of the dissimilarity of two distri- 
butions and thus provides an indirect measure of the ability 
of the classifier to discriminate successfully between them. 
Computation of this measure for n-tuples of the available 
features provides a basis for selecting an optimal set of n 
features . 

Divergence is defined for two distributions. Remote sens- 
ing problems usually involve m > 2 classes. Several strategies 
have been suggested and used for feature selection in the multi- 
class case. 

One strategy is to compute the average divergence over 
all pairs of classes and select the subset of features for 
which the average divergence is maximum. That is, maximize with 
respect to all n-tuples 


2 m-1 m 


®AVE »^ 2 ,» . . , Cj.) ————— T, Z D(i,j|ci,C 2 ,..., 

m(m-l) i=l j=i+l 


(50) 

While this strategy is certainly reasonable there is no guarantee 
that it is optimal. It must be used with care. For instance, a 
single pairwise divergence, i.e., a single term in (50), if it 
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were large enough, could make the average very large. This is 
illustrated in Figure 9. So in the process of ranking feature 
combinations by it Is a good idea to examine each of the 

pairwise divergences as well. 

Another strategy is to maximize the minimum pairwise diver 
gence, i.e., to select the feature combination which does the 
best job of separating the hardest- to-separate pair of classes. 
This is not a Bayesian (minimum risk) strategy, but it is cer- 
tainly a reasonable strategy for many remote sensing problems. 

The problem illustrated in Figure 9 is amplified by the 
following fact: As the separability of a pair of classes 

increases, the pairwise divergence also increases without limit 
but the probability of correct classification saturates at 100 
percent (see Figure 10). A modified form of the divergence, 
referred to as the "transformed divergence," D^, has a 
behavior more like probability of correct classification: 

= 2[l-exp(-D/8)] (51) 

where D is the divergence discussed above. The saturating 
behavior of this function (see Figure 10) reduces the effects 
of widely separated classes when taking the average over all 
pairwise separations. D^y^ based on transformed divergence 
has been found a much more reliable criterion for feature 
selection than the D^yg based on "ordinary" divergence. 
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Although D^vE be larger in (a) , overall classification 

accuracy may be better for the situation in (b). 


Figure 9. A Disadvantage of D^yg. 





Figure 10. Relationship of Separability and 

(a) Probability of Correct Classification, 

(b) Divergence, (c) Transformed Divergence 
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CLUSTERING 

Clustering is a data analysis technique by which one 
attempts to determine the "natural” or "inherent" relationships 
in a set of observations or data points. It is sometimes refer- 
red to as unsupervised olaseifioation because the end product 
is generally a classification of each observation into a "class" 
which has been established by the analysis procedure, based on 
the data, rather than by the person interested in the analysis. 

To get an intuitive idea of what is meant by natural or 
inherent relationships in a set of data, consider the examples 
shown in Figure 11. If one were to plot height versus weight 
for a random sampling of students, without regard to sex, on a 
college campus, it is likely that two relatively distinct clusters 
of observations would result, one corresponding to the men in 
the sample (heavier and taller) and another corresponding to the 
women (lighter and shorter). Similarly, if the spectral reflec- 
tance of vegetation in a visible wave band were plotted against 
reflectance in an infrared wave band, dry vegetation and green 
vegetation could be expected to form discernible clusters. 

If the data of interest never involved more than two 
attributes (measurements or dimensions) , cluster analysis 
might always be performed by visual evaluation of two-dimensional 
plots such as those in Figure 11. But beyond two or possibly 
three dimensions, visual analysis is impossible. For such cases. 



women 


ll)0 lb 

X 2 (infrared) 



190 lb weight 


.green 

^ vegetation 





dry- 

vegetation 


Xj (visible) 


Figure 11. Examples of Data Clusters 
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it is. desirable to have a computer perform the cluster 
analysis and report the results in a useful fashion. 

Why is clustering a useful analysis tool? Clustering has 
been applied as a means of data compression (eg* » for transmis- 
sion or storage) and for the purpose of determining differenti- 
ating characteristics in complex data sets (eg. , in numerical 
taxonomy). An increasingly important application is unsuper- 
vised classification, in which the clustering algorithm determines 
the classes based on the clustering tendencies in the data. 

The results of such a classification are useful if the "cluster 
classes" can be interpreted as classes of interest to the data 
analyst. 

With respect to LARSYS, the greatest use of cluster analysis 
has been for the purpose of assuring that the data used to 
characterize the pattern classes do not seriously violate the 
assumption of gaussian statistics. In general it may be expected 
that each distinct cluster center will correspond to a mode in 
the distribution of the data. Therefore, by defining a pattern 
subclass for each cluster center, the possibility of multimodal 
(and hence definitely non-gaussian) class distributions is 
essentially eliminated. 

The reader interested in the many possible ways of defining 
clustering in quantitative terms may consult the references 
(Wacker and Landgrebe, 1971; Hall, 1965). Essentially, the 
definition of a clustering algorithm depends on the specification 
of two distance measures: a measure of distance between data 
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points or individual observations; and a measure of distance 
between groups of observations. Figure 12 is a block diagram 
for a typical clustering algorithm (including the LARSYS 
algorithm). The point-to-point distance measure is used in the 
step labelled "Assign each vector to nearest cluster center." 
The distance between groups of points (clusters, in this case) 
is calculated in the step "Compute separability information." 

Euclidean distance, the most familiar point-to-point dis- 
tance measure, is defined for two n-dimensional points or 
vectors X and Y as follows: 


Euclidean distance: D = 


n 


1/2 


i=l J 

Several alternatives are ‘"available as candidate measures 
of distance between clusters, each having its peculiar advantages 
and disadvantages. One possibility is the 

divergence or transformed divergence used for feature selection. 
In LARSYS, a measure called "Swain-Fu distance" has been imple- 
mented, which compares the separation of cluster centers to the 
dispersion of the data in the clusters. The dispersion of the 
data in a cluster is measured in terms of the "ellipsoid of 
concentration" associated with the cluster. 

Ellipsoid of concentration; Let the random vector X have 
a distribution with mean vector U and covariance matrix • 

If Z is another random vector uniformly distributed over the 
volume of the ellipsoid given by 


n n |Zi-i| 


n+2 


(53) 
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Calculate means of 
new "clusters" 

(new cluster centers) 



Figure 12. Clustering Algorithm 
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where n is the number of components in Z, and is the 

cofactor of , then Z also has zero mean and covariance matrix 
E. The ellipsoid Q is called the ellipsoid of concentration 
of the distribution of X. 

Q as given by equation (53) is the ellipsoid of concentration 
of any distribution with mean U and covariance Z and in particular 
serves as a geometrical characterization of the concentration 
(or equivalently, the dispersion) of these distributions. 

Consider two clusters and their respective ellipsoids of 
concentration as shown in Figure 13. Dj 2 is the distance between 
the cluster centers. is the distance from the center of 
cluster 1 to the surface of its ellipsoid of concentration along 
the line connecting the cluster centers. Similarly D 2 is the 
distance from the center of cluster 2 to the surface of its 


ellipsoid of concentration along the line connecting the cluster 
centers. In terms of these distances, Di , D 2 , Dj 2 , the Swain-Fu 
distance is given by 


A = 



(54) 


In terms of the cluster centers (cluster means) and the covariance 
matrices associated with the clusters, the Swain-Fu distance can 
be expressed as 

/c7~c7 

A = (55) 

/cT“+ /cT 

where 

(Ui-U2)(Ui-U2)’^} 


tr{A} = trace of matrix A 

= covariance matrix for cluster k 
= mean vector for cluster k. 
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Rule (distinctness) : Clusters 1 and 2 as given above are 

considered distinct provide A>T Mihere T is a suitable threshold. 

Empirically, it is observed that two clusters for which A 
is greater than 0.75 will generally exhibit a multimodal distri- 
bution if pooled as a single class. 

An illustration will provide some insight as to how the 
algorithm implemented in LARSYS produces clusters from a mass of 
data (refer to Figures 12 and 14). The first step is to select 
initial cluster centers. The analyst must specify how many 
clusters are to be isolated; the algorithm determines (arbitrarily) 
where the initial centers are to be located (the final results 
are relatively insensitive to the initial selection). Each data 
point is then labelled as "belonging" to the nearest cluster 
center (using Euclidean distance) , effectively creating a cluster 
of data points associated with each center. The boundaries between 
clusters are formed by the lines (planes in n-dimensional space) 
which are the perpendicular bisectors of the lines connecting the 
centers. Next, new cluster centers are calculated. The new center 
for each cluster is the mean (in general, mean vector) of all 
points just assigned to that cluster. A check is made to see 
whether the algorithm has achieved the final result, which is the 
case when the new cluster centers are identical with the previous 
centers (or, equivalently, if no data points have changed their 
cluster "allegiance"). If necessary, the data points are assigned 
to the nearest new cluster center, and the process is cycled 
repeatedly. When no further change is detected, the pairwise 
distances (Swain-Fu distance) between the resulting clusters are 



Xi 


(d) 


Xi 


Figure 14. 


[c) 


A Sequence of Clustering Iterations 
(a) Initial Cluster Centers (b) (c) 
Intermediate Steps (d) Final Center 
Configuration. 
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computed and all results are printed for evaluation by the 
analyst. These results include maps showing the final cluster 
assignments of all points in the area(s) analyzed, and all 
pairwise distances between clusters. The analyst must decide 
which of the resulting clusters are distinct and which should be 
pooled to define the classes for the maximum likelihood pattern 
recognition analysis. 

SAMPLE CLASSIFICATION 

Sample classification is a slight generalization of a concept 
which has been referred to in agricultural contexts as "per-field 
classification," In per-field classification, a statistical 
characterization of the data points in a field (actually, any 
rectangular area on the ground) is calculated and compared 
against the statistical characterizations of the pattern classes. 
Then the field (i.e., the aggregate of points in the field) is 
classified as a single unit. This is in contrast to the point- 
by-point classification method discussed previously in which each 
observation is given a classification which is assigned inde- 
pendently of all other observations. In sample classification an 
aggregate of data points is characterized and classified as in 
per-field classification except that the data points need not 
necessarily be taken from a spatially contiguous area (i.e., need 
not comprise a field). The only requirement is that the data 
points must all be assumed to be from the same class -- thus 
comprising a sample from a single population, in statistical terms. 

The sample classification approach has some significant 



' ■ ;h 

potential advantages over the more conventional point classification 
Essentially, the decision process has at hand more information on 
which to base each classification decision, since it utilizes more 
than a single observation. The sample classification algorithm 
in LARSYS computes the sample mean and the sample covariance 
matrix for the data to be classified. The averaging process tends 
to eliminate the effects of system noise and other irrelevant 
variability in the data. The sample covariance matrix together 
with the class covariance matrices serve on one hand to provide 
appropriate factors for weighting the difference between the sample 
mean and each class mean; on the other hand, they may contain 
information which is important in itself for characterizing the 
pattern classes of interest and associating the sample with the 
appropriate class. An example of the latter phenomenon has been 
observed in analyzing flightlines containing both corn fields 
and forested areas. The average reflectance of the forest may 
be very much like the average reflectance of corn -- in fact, 
single observations from each may be very nearly identical. 

However, the spectral variability of forest cover is typically 
much greater than that of corn and this is reflected in the 
covariance matrices. As a result, the sample classifier can per- 
form much more accurately than the point classifier in discrimin- 
ating between corn and forest. 

It should be clear to the reader from the preceding example 
that the sample classification approach is more powerful than an 
approach which would classify all points on an individual basis 
and then classify "fields" according to "majority rules." 
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Formally, the sample classification procedure may be 
defined as follows: 

Let d(*,*) be a measure defining the distance between 
two probability density functions and let {pCX|to^), 
i = 1, 2, m} be a set of probability density func- 

tions corresponding to the classes uii, “2»***»%* 

{X} is a sample (a set of observations) with estimated 
probability density pCXju^) then: 

Decide {X} e if and only if 
d[p(X|oj^), p(X|w^) ]_< d[p(X|u^) , p(X|tOj)] 
for all i, j, = 1, 2, ...» m. 

The concept of distance between probability density functions 
is the same as that discussed earlier with respect to feature 
selection. In fact, the same distance measure could be used, 
although a different distance measure, called Jeffries-Matusita 
distance (see Wacker and Landgrebe, 1971) has been implemented 
in LARSYS. 

For writing the definition of Jeffries-Matusita distance 
(JM distance), it is convenient to use an abbreviated notation 
for the density functions. Let 

P^CX) = p (Xjuj^) . 

Then the JM distance between density functions Pi(X) and P2CX) 
is given by 

d(Pl(X),P 2 (X)] « [ f (/pTOrr - /piTTT)^ dXlVi (56) 

'X 

where the integral is over the entire multi-dimensional space of 
X. By defining 
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pCPi. P 2 ) = / •'pTirr • ■'pTOT dx (57) 

the JM distance can be expressed as 

d[pi(X), P 2 CX)] = [2(l-p(pi, P2))]/^- (58) 

In the case of gaussian distributions with class mean vectors 

U. , covariance matrices s. , and a sample with mean U and covariance 
1 ^ 1 ^ 


(58) 


matrix Eq. (58) can be written in the form 

. -1 -1 . i/*» 


, l^x I 


(59) 


^i 


(-eC* Ux " “x " “i’ 


T -] 

+ U E 
X X 


It is significant that this expression can be evaluated without 
performing explicit integration. 

In practice the U's and E's are usually not known, and 
estimates are used which are obtained frdm training patterns and 
from the sample to be classified. 


CONCLUDING REMARKS 

The foregoing is a description of the theoretical foundations 
of LARSYS, an approach to multispectral data analysis through 
pattern recognition and related computer-oriented techniques. 

The state-of-the-art of machine-assisted remote sensing data 
analysis is changing rapidly as more powerful methods are sought 
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to meet ever-more-challenging remote sensing problems. It may 
be expected, however, that unless some radically different approach 
is developed which proves more effective, the techniques treated 
herein will continue to be extensively applied. The reader who 
can take time to develop a working understanding of this material 
will be well equipped to apply pattern recognition techniques 
to remote sensing data and to interpret with insight the analysis 
results he obtains. 
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